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Abstract 

We present and discuss new importance sampling schemes for the approximate compu- 
tation of the sample probability of observed genetic types in the infinitely many sites model 
from population genetics. More specifically, we extend the 'classical framework', where ge- 
nealogies are assumed to be governed by Kingman's coalescent, to the more general class 
of Lambda-coalescents and develop further Hobolth et. al.'s (2008) idea of deriving impor- 
tance sampling schemes based on 'compressed genetrees'. The resulting schemes extend 
earlier work by Griffiths and Tavare (1994), Stephens and Donnelly (2000), Birkner and 
Blath (2008) and Hobolth et. al. (2008). We conclude with a performance comparison of 
classical and new schemes for Beta- and Kingman coalescents. 
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1 Introduction 

1.1 Aims and outline of the paper 

In the present paper we derive and discuss importance samphng schemes for the approximate 
computation of the samphng probability of observed genetic types in the infinitely many sites 
model (ISM), which is used for the analysis of DNA sequence data sampled from a population. 

In particular, we extend earlier results on this classical problem of likelihood estimation in 
mathematical genetics in two directions. 

First, we consider genealogies which may be governed by any member of the rather gen- 
eral class of Lambda-coalescents instead of restricting to the classical Kingman's coalescent 
framework only. These genealogies offer more flexibility in the modelling of 'exceptional ge- 
nealogical events' like extreme reproduction and selective sweeps, see e.g. |BB08j for a brief 
discussion. In particular, we derive the analogues of the 'Kingman-scenario' based importance 
sampling schemes of Griffiths and Tavare |GT94] . Stephens and Donnelly [SDOOj and Hobolth, 
Uyenoyama and Wiuf |HUW08j . 
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For the second direction of our investigation, observe that both the schemes derived by 
Ethier and Griffiths and Stephens and Donnelly do not take any specific information about 
the genealogical distance of types (which is provided by the infinitely many sites model) into 
account. Indeed, the latter proposal has been explicitly derived by means of optimality for 
parent-independent mutation models which in particular do not provide information about ge- 
nealogical distance. Hobolth et. al. [HUWOS] proposed a scheme which can be regarded as 
a starting point to overcome this simplification. Indeed, for their proposal distribution, they 
'compress' the observed genealogical tree to a tree where only one segregating site remains, 
derive optimal proposals for this compressed tree, and show how to combine them to obtain a 
proposal for the original tree. We show how to extend this method to compressed trees with 
two (and, in principle more) segregating sites, which retain information about the topology of 
the original tree and the genealogical distance of the types of the sample, leading to further 
improved importance sampling schemes (also in the Lambda-coalescent scenario). We 'pay' for 
this additional genealogical information with an increase of complexity in the derivation and of 
the proposal scheme. Along the way, we discuss the optimality of the analogue of the Stephens 
and Donnelly proposal for the Lambda-coalescent in the infinitely many alleles model. 

The paper is organised as follows. In Section 11.21 we discuss in detail the combinatorial 
framework of samples in the infinitely many sites model. In Section 11.31 we formulate various 
recursions which form the basis of our importance sampling schemes. Section [2. II and Section [2^2] 
discuss the creation of sample histories and the basic framework for importance sampling. We 
will also briefly discuss the notion of optimality. In Section [2]3] we extend earlier and derive new 
important sampling schemes, whose performance we will analyse in Section El In the appendix, 
we will provide an algorithm for generating sample histories (jA.ip . derive some auxiliary results 
on the combinatorics of the infinitely many sites model (|A.2p and briefiy discuss computational 
aspects ()A.3p as well as estimation of event times, given the observed data ()A.4p . 

1.2 Genealogies and samples in the infinitely many sites model 

We consider samples taken from a large panmictic population of constant size evolving due 
to random mating and mutation according to the infinitely many sites model. We study the 
distribution of (neutral) genetic variation at a single locus and may therefore assume that the 
genealogy of the sampled genes is described by an exchangeable coalescent process. Extending 
the classical framework of [ EG87j . we consider in particular genealogies governed by so-called 
Lambda-coalescents, hence allowing multiple, but not simultaneous multiple collisions. 

Recall that Pitman ( |P99] ) and Sagitov ( |S99] ) introduced and discussed coalescents in which 
more than just two blocks may merge at a time. Informally, a Lambda-coalescent is a partition- 
valued Markov process, whose dynamics is as follows: Whenever there are 6 G N blocks in the 
partition at present, each A;-tuple of blocks (where 2 < k <b < n) merges to form a single block 
at rate Xb^k, where the rates are given by 





for some finite measure A on the unit interval. 



Further, denote by 



b 




(1.2) 



k=2 



the total rate at which mergers happen while there are b blocks present. 
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Note that the family of Lambda-coalescents is rather large, and in particular cannot be 
parametrised by a few real parameters. Important examples include A = 5o (Kingman's coa- 
lescent) and A = 6i (leading to star-shaped genealogies, i.e. one huge merger into one single 
block). Later, we will also be concerned with an important parametric subclass of A-coalescents, 
namely the so-called B eta- coales cents, where A has a Beta(2 — a, a)-density for some a G [1,2]. 
Note that such coalescents occur as limits of genealogies of population models, where single 
individuals may occasionally be able to produce almost instantaneously a non-negligible frac- 
tion of the total population size, see e.g. [BB09J for a review. W.l.g. we assume that A([0, 1]) = 1. 

We now introduce detailed notation to describe samples in the infinitely many sites model. 
Note that we represent our data in the form presented in |EG87] resp. [(jT94j . A discussion of 
how to transform actual DNA sequence data into this format can be found e.g. in |BB08|. Sec- 
tion 2.1] (assuming known ancestral types for each segregating site). Although the notation for 
the description of samples in the infinitely many sites model under various equivalence classes 
seems to be relatively standard, we chose to provide full details here, including several formula- 
tions of the recursions for observed type probabilities, since the treatment of the combinatorics 
of samples is somewhat inconsistent across the literature (see, e.g.. Remark 11.21 for some of the 
subtleties). 

We represent a sample of size n by a vector x = (xi, . . . , x„) of n genetic types, where each 
type Xi is given as a list of positive integers representing mutations 

Xi = (xio, . . . ,Xij.) e Z^_^. (1.3) 

Such an x is called a tree if 

1. for fixed i G {1, . . . , n} the coordinates distinct for all j € 

2. whenever for some G {l,...,n}, j, j' € Z+, Xij = Xi'j' holds, then Xij+i = Xi'j'^i 
holds for all / € Z+, 

3. there exist ji, ■ ■ ■ ,jn S Z+ such that xij-^ = X2j2 ■ ■ ■ = Xnj„- 

The space of all trees of size n is denoted by Tn- 

Next, we introduce an equivalence relation '~' on 7^, where two trees x, y G 7^ are said 
to be equivalent if there exists a bijection C,: Z+ — )• Z+ such that y-ij = C,{xij) holds for all 
i € {1, . . . , n} and j € Denote by (Tn/^) the set of equivalence classes under the relation ~ 
and by (7^/~)o the restriction of this set to those classes where ^ Xj if i ^ j. The number of 
segregating sites s is given as the number of different Xij that appear in at least one but not all 
elements in x. Note that this does not depend on the actual representative of the class. Denote 
by (7^,n/~) the set of equivalence classes representing a tree of size n with s segregating sites. 
Note that for simplicity, we will always assume Xij € {0, 1, . . . ,s}. Recall that the complexity 
of a sample of size n with s segregating sites is defined to be n + s — 1. Elements of (Tn/^) are 
called unlahelled trees in |EG871 p. 528, 1. -10]. We will sometimes emphasise the fact that the 
order of the samples (equivalently, of the types in the case of distinct entries) plays a role by 
calling them ordered unlahelled trees. 

A type configuration x = (xi, . . . , x„) G (Tn/^) can be represented by a pair (t, a) of a tree 
t G (7^/~)o of the different types that occur in x and an ordered partition a = (^i , . . . , A^) that 
specify at which position in the sample the corresponding type occurs (i.e. we think of ordered 
types). The number of distinct types is denoted by d = |{xj: i = l,...,n}|. Furthermore, 

= {j '■ U = Xj}, Ai n Aj = (1) Mi ^ j and UiLi ^« = {^^---^n} holds. Note that this 
notation introduces an artificial order of the occurring types. In the sequel the actual sample 



3 



numbers of the types will play no role, but rather the multiplicities. For this purpose define 
nCa) := (1^;^ |j . . . J^^Dj the vector containing the sizes of the sets in a. We denote by 

(t,n)eU^=i(rd/~)oxN'^=:r* 

(where n = n*^**-*) an ordered type configuration with multiplicities. Note that for a given (t,n) 
with d types, there are n!/(ni! • • • n^!) different choices of a consistent with n. 
Finally, we define the equivalence relation '~' by saying that 

(t,a)«(t',a') (1.4) 

holds if there exist a bijection Q: {1, . . . , s} — )• {1, . . . , s} and a permutation a & Sd such that 
Xij = C{xa{i)j) and n^*^) = {n^^^)fj, where x is representative of the class t and a is applied to the 
vector componentwise. Note that under this equivalence class the order of the types is lost. We 
denote such an equivalence class by [t, n] = [t, n^*^-*] and call it an unnumbered unlahelled sample 
configuration with unordered types, sample configuration, or genetree, because it accounts for 
the fact that in a sample obtained from a population, the numbering of the types and mutations 
is artificially imposed. Summarising, in the following we will consider equivalence classes 

(t,a)G(r„/~) and [t, n] G (TJ«). (1.5) 

Note that [t,n] in our notation denotes [^>n(t)] in the notation of |EG87| . with [•] referring to 
the equivalence class under Ki. 

However, one should be warned that there are several combinatorial conventions present in 
the literature, see Remark 11.21 for a discussion of some of the ensuing subtleties. 

Remark 1.1. Note that by ignoring the tree structure given by t and just considering the 
partition n one can map a sample under the infinitely many sites model to a sample in the 
infinitely many alleles model. This observation underlies some of the importance sampling 
schemes discussed below, see Section 12.3.11 However, the additional information provided by 
the infinitely many sites model can be exploited to find more efficient proposal distributions, 
see Section [2.3.21 



1.3 Recursion for tree probabilities 

In this section we recall from |BB08j recursions which allow the computation of the probability 
of observing a given type configuration (t,a). In the sequel, we always think of randomly 
ordered types. 

Indeed, with the above notation, the probability of obtaining a given sample (t,a) from 
the stationary distribution of the population under the infinitely many sites mutation model 
satisfies the recursion 

1 /\AW 
= rn + X ^ f! )KkP(.t,^-ik-l)ei) 

" i:\Ai\>2k=2 ^ ^ 

^^ H,? , ''<^-">'='* (1.0) 



\ ^ P(ti(t),ri(a + ej)) 



rn + A„ d 

x^QuniquG 



with the boundary condition p((0), ({1})) = 1. Here, Xjo unique means that mutation Xjo occurs 
only in type i. The operator s(x) [the operator Si(t)] removes the outmost mutation [from type 
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i] and ti{t) removes the i-th component of the vector t. By a — (k — l)ej we mean a partition 
obtained from a by removing k — 1 elements from the set ai (with implicit renumbering of the 
samples so that the result is a partition of {!,... ,n — k + 1}). Note that by symmetry, the 
type probability p will not depend on the actual choice. Finally, a + ej denotes the partition 
obtained from a by adding an arbitrary element of N to the set Oj that is not yet contained in 
any other set a;, / = 1, . . . , d. 

()1.6p can be seen by conditioning on the most recent event in the coalescent history (or, 
equivalently, in the lookdown-construction into which the so-called 'A-Fleming-Viot process', 
describing the population forwards in time, can be embedded), see |BB081 Section 4] and [S091 
Section 3.3.2] for details and proofs. 

Note that in the 'Kingman-case', i.e. A = 5q, this essentially reduces to the recursion provided 
by Ethier and Griffiths in |EG871 Corollary 4.2] (see also Remark 11.21 below). The relation 
between the sampling probabilities of ordered numbered samples p(t,a) and the probabilities 
of the corresponding unordered unnumbered samples p[t, n] is given by 

p[t,n] =p(t,a)— -^^^ --^ = p{t,TY)-^ (1.7) 

ni!---nd!c(t,n) c(t,n) 

Here, n = n^'*^ n!/(ni! • • • n^;!) is the number of ordered partitions of {1, . . . ,n} into d subsets 
with the given sizes, corresponding to the d types and 



c(t,n) : = 



{^a ^ Sd '■ t ^ and n = rio-Vi} , (1. 



where rio- = (no-(i), . . . ,n-(j{d))- There are d\ possible orders for the types if the mutations carry 
distinct labels, each of which is equivalent to c(t,n) others if mutation labels are disregarded. 
Thus, there are (i!/c(t,n) different re-orderings of the types that cannot be transformed into 
each other by re-labelling the mutations, explaining ()1.7p . 

Note that c(t, n) = c([t, n]) depends in fact only on [t, n]. For a constructive way to evaluate 
c(t,n), see Lemma lA.ll 

Recursion (jl.6p can be combined with relation (|1.7p to obtain a recursion for the unordered 
sampling probabilities p[t,n]: 

1 1 Y^V^/^'^^^ ni-k+lc{t,YY- {k -l)ei) 

" i:ni>2k=2 ^ ^ \ ■• J 

r ^ c(si(t),n)) 

^KT^r_ 22^ (1.9) 

— 1,3; jQ unique, ^ ' ^ ^ 

V- / , ,, c(ti(t),ri(n + ej)) 



~^ Xn + nr . /-^ ^'"-^ ' ^' c(t,n) 



x.^Q unique 

for the sampling probability of the unordered sample [t,n]. Again, we have the boundary 
condition p((0), (1)) = 1. In terms of samples with ordered types (t, n), the recursion reads 

P(t>n)=- — ■ y] , )K,k— — 7——p{t,n-{k-l)ei) 

A„ + nr ^-^ ^-^ \k I n — k + \ ' 

" i:ni>2k=2 ^ ' 

i:rij — 1, a; unique, V / 



"^T-^Jl E E (^, + l)l'(t.(t),r,(n + e,)). 



A„ + nr d 

x^Quniquc 
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with the usual boundary condition. 

Remark 1.2. Note that the recursion given by Ethier and Griffiths in jEG871 Corollary 4.2] 
closely resembles our recursion (jl.6p in the case A = 6q, r = 9/2, up to a missing factor l/d in 
the last term on the right-hand side. This subtle discrepancy can be resolved as follows. 

As before, let (t,a) denote an unlabelled ordered sample of d (ordered) types stored in t 
together with an ordered partition a = {Ai, . . . , A^) and let [t, n] be the corresponding sample 
with d unordered types stored in t and multiplicity vector n = (ni, . . . ,nd)- Recall that we 
have 

n! d\ 

= — -— - — -p{t,a), 

nil ■■■Udl c(t,n) 

where p([t,n]) solves Recursion ()1.6p and p(t,a) solves ()1.9p . In contrast, let (t,a) denote a 
sample with d (unordered) types and type partition a = {^i, . . . ,Ad}, where the types in the 
vector t € (7^/ ~)o are ordered by appearance in the sample (any other deterministic recipe of 
deriving an order on the types from the sample would work equally well). Then, we have 

n\ 1 

Plt.nJ = — -— — -p(t,a), 

ni\---nd\c[t,n) 

which corresponds to (4.12) in [EG87] and is consistent with the displayed equation on p. 86, 
1. -13 of |GT95] . and 

P(t,a) = ^P(t,a). 

If one interprets the notation (T, n) of |EG871 Corollary 4.2] as a canonical representative of 
(t,a), then p(t,a) solves recursion (4.4) in |EG87j without additional factor l/d in front of the 
last term. 

While all the above recursions yield probability weights (resp. likelihoods), for practical 
purposes it is often easier to multiply (jl.Op by c(t, n) and thus derive from ()1.9p a recursion for 
p^[t,n] := c(t, n)p[t, n]. This is the recursion given by |BB081 Corollary 1], and it is also the 
recursion implemented by genetre^ll (for the Kingman case) and MetaGeneTree||. However, 
one should be aware that the p^[t,n] may not always be interpreted as probability weights (for 
example consider the star-shaped tree t = ((1, 0), (2, 0), . . . , (d, 0)) with n = d, n = (1, . . . , 1); 
for d = 22, with A = 6o and r = 7, genetree yields ^"[1,11] « 2.26). Still, this method can be 
used to compute maximum likelihood estimators, and the correct probability can be recovered 
by dividing by c(t,n). 



2 Derivation of importance sampling schemes 
2.1 Simulating sample histories 

In the sequel, we will always parametrise a sample as an unlabelled tree with ordered types (t, n). 
Recursion (|1.10p can be used directly to calculate sampling probabilities for a given sample 
configuration (t,n), noting that the sample complexity is reduced by each step. However, for 
practical purposes this naive approach is only tractable for samples of rather small complexity 
due to the huge number of terms involved (the coefficient matrices of the right-hand sides of 
()1.6p . ()1.9p . resp. (jl.lOp are substochastic, hence numerical stability itself is not an issue). 

One way to deal with this problem is to consider importance sampling using so-called 
(coalescent-) /izstones. Informally, describing samples via ordered types with multiplicities, such 

^Version 9.0, available from http: //www. stats . ox . ac .uk/- griff /sof tware .htini] 
Version 0.1.2, available from http://metagenetree.sourceforge.net 
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a history 

% = {H^r+i, ■ ■ ■ , Hq) 

is the chronologicahy ordered sequence of the r (different) states in T* one observes when 
tracing the coalescent tree with superimposed mutations from the root to its leaves (see, e.g., 
p3B08, Steps (i)--(vii) in Section 3]), where H^r+i = ((0), (1)) is the root and Hq = (t,n) is the 
observed sample. 

A computationally efficient way of generating samples is described in lA.l) adapting [BB081 
Algorithm 1]. Let 6 = (r. A) be the underlying 'parameter' of our model (mutation rate and 
Lambda-coalescent). For a given sample size n, this algorithm constructs the path of a Markov 
chain with law ,„ in T* terminating in a sample of size n. Its transition probabilities are 
given by (as usual, denoting |n'| by n') 

d w.p. if n' = n, 



t',n')^(t",n") 



(t',n' + /e,) w.p. ifn' + /<n(Z>l), 

(a,(t'),n') w.p./; ifn^ = l, 

(ei,,(t'), e,(n' - e,)) w.p. ^^^^ if > 1 (j = 1, . . . , d + 1). 

(2.1) 

Here, (t', n') denotes the current state (with d types), i denotes the type that is involved in the 
transition event with 1 < i < d, and 

The function aj(t') attaches a mutation to the type i. The operator ei,j(t') copies type i, 
attaches a mutation and inserts the resulting type at position j in the vector t. The expression 
e,(n') 

denotes the vector 

ej(n') = tj{n\,.. . ,n^) := (n\, . . . ,n'j_^,l,n'j, . . . ,n'a). 

Note that for given (t', n') and a type i G {1, . . . , d}, it is in principle possible that the values 
(t", n") = (eij(t'), ej(n' — Bj)) are identical for several choices of j. The number of such j equals 

r , , . , nfc = 1 and type k differs from type i \ 
nio(t ,n ,i) := 1 + #n < /c < d : , . , ,. > 

L by exactly one unique mutation J 

('nio' stands for 'number of immediate offspring'). The are the transition rates of the 
time-reversed block counting process of the underlying Lambda-coalescent, see lA.ll Finally, d 
denotes a cemetery state. Once reached, the sample has been generated and is given by the 
penultimate state (t',n') (from which the cemetery state had been reached). 

It is straightforward to read off the transition probabilities from (j2.ip . observe in particular 
that 

r nio(t',n',i) , 



,4He = {t",n")\He-i = {t',n')) 



Vn' d+1 



if (t", n") = (cj J- (f) , Zj (n' — e^)) . (No such ambiguities arise for the transitions in the first three 
lines of (O).) ' 



We have 

p{t,n)= Yl ^e,nm, (2-2) 

W:Ho=(t,n) 
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were the sum extends over all different histories (of possibly different lengths) with terminal 
state Hq = (t,n). Recursion (jl.lOp is just a way to enumerate all consistent histories and 
compute the sum in (|2.2p . An obvious 'naive' approach to estimating p{t, n) is via direct Monte 
Carlo: Indeed, 

1 

MZ]^{(W«)o=(t,n)}> (2-3) 
1=1 

where are independent samples from Pg_„(-), is an unbiased estimator of p(t, n). 

Unfortunately, even for small sample sizes, the variance of ()2.3p is typically too high for ()2.3p to 
be of practical value, since p(t, n) can easily be of the order 10~^^ (see Table [2] for examples). 



2.2 Importance sampling and the optimal proposal distribution 

Importance sampling is a well-known approach to reducing the variance of estimators of the form 
(|2.3p . In the following, we will think of a fixed sample size n and will thus lighten notation by 
denoting ¥g := Pg^n- Consider a proposal distribution Qei-) on the space of histories satisfying 



{Ho=(t,n)} 



and use it to rewrite equation (j2.2p as 



This shows that 



H:Ho = (t,n) 
i=l 1=1 

where 'H^^\ . . . M^'' are independent samples from Qe{-)., is also an unbiased (and consistent 
as M — > oo) estimator of p(t, n). Denote by 



win 



g(7^W) if(7^»)„ = (t,n) ^^^^^ 



otherwise 



the importance sampling weight or IS weight of the history 'H^'^\ Our goal now is to derive 
proposal distributions for which the variance of the estimator (j2.6p is small. 

The optimal proposal distribution Qg, for which this variance vanishes, is given by 

Ql{'H)=¥e{n\Ho = {t,n)], (2.8) 

the conditional distribution on the histories given the observed data. Under Qg, the importance 
weight w{'H) equals p(t, n) for all histories T-L compatible with the data. Hence, the (consistent) 
estimator (|2.6p becomes deterministic, and its variance is thus zero. 

Since for a given ^e^H) is straightforward to evaluate, we see from (|2.8p that explicit 
knowledge of the optimal proposal distribution is equivalent to knowing p(t,n), so not surpris- 
ingly Qq cannot be given explicitly in general. This also applies to the Kingman case except 
for so-called parent-independent mutation models, as observed in [SDOOj . 

It is natural to consider proposal distributions Qq under which the time-reversal of the 
history is a Markov chain starting from the observed configuration (t, n) and ending at the root 
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((0), (1)), thus guaranteeing that the weights in (|2.7p are strictly positive. Indeed, by elementary 
properties of Markov chains, Qg has this property. 
Let 





G(")(t',n')=Ee,„[ Yl l{(t',n')}(^^)] (2.9) 



-—T + l 

denote the associated Green function, that is the expected time the Markov chain with transition 
probabilities (j2.ip (for samples of size n) spends in the state (t',n'). Note that by the special 
structure of the transitions in (|2.1|) which increase the sample complexity in each step, we in 
fact have (for n > \n'\) 

GW(t',n')=Pe,n{3^ :F, = (t',n')}. (2.10) 
Lemma 2.1. For (t,n) with |n| = n we have 

p(t,n) = G'W(t,n) ^" (2.11) 

nr + q)i' 

More generally, for (t',n') with |n'| = n' < n, 

, , G(")(t',n') 
Pit', n') = \ (2.12) 

g{n,n')[n'r + q)^, ') 

where g{n, n') is the Green function of the block counting process of the underlying Lambda- 
coalescent, see \A..2\ . 

Proof. (|2.1ip follows from ()2.10p and the fact that under '^e,n when the chain is currently in a 
state with sample size n it terminates with probability (gl"'')/(nr + gl"^), see the second case 
in Step 2 of Algorithm 1 in lA.li 

We see from (|A.6p and (|2.ip that the probabilities for transitions between states with at 
most n' samples agree under Pg^^ and P^.n/ except for terms involving q^^ ^ resp. q^'f' . Using 
(jA.Sp on the product of these terms yields 

P,,„{3£ : H, = (t',n')} = 4^1Pe,n'{3^ : H, = (t',n')}. (2.13) 

g{n,n') 

Using ()2.10p . ()2.1ip and observing g{n',n') = l/{—qn'n') = l/^i" \ we obtain 

p{t',n')=Fe,n'{^i ■■He = it',n')] 



G("')(t>')^n__^ 

/I -{"■) 

Q{n)n^, ^,. g{n,n) q^, 

g{n, n') ^'r + g^?'^ 
G(")(t',n') 



g{n,n'){n'r + gfj? ^) 

which is (I2J2I1 . 



□ 



9 



Lemma 2.2. The time-reversed history % under Qg is a Markov chain started in Hq = (t,n) 
with transition probabilities given by 

Ql{He.i = {t',n')\He = {t",n")) 
G(")(t',n') ^ 



G{")(t",n" 



He = {t",n")\He-i = {t',n')), (2.14) 



where the transition matrix under Pg ^ is described in Section \2. 1\ and n = |n|. The chain is 
absorbed in the root ((0), (1)). The transition probability in ()2.14p is independent of n (provided 
n > |n"|;. 

Sketch of proof. The optimal proposal distribution is the distribution of histories simulated with 
Algorithm 1 in lA.ll with transition (j2.ip conditioned on observing (t, n) as the penultimate state 
before hitting the 'cemetery' d. Nagasawa's formula can thus be applied to obtain the transition 
probabilities (j2.14p of the time-reversed chain (see e.g. [RW87j . Sect. III. 42). □ 

The fact that ()2.14p does not depend on the target sample size n stems from the consistency 
properties of Lambda-coalescents and is made explicit in the following Remark 1 2. 3 [ 

Remark 2.3. By Lemma l2.H we may express the transition probabilities of the time-reversed 
history under Qq explicitly via p as follows (with notation as above): 



Q;(i/,„i = (t',n') |i/, = (t",n")) 

Kt',n'; 



p(t",n") 



^1ir^i(ln",n"-i if (f, n') = (t", n" - lei), 

^ if (t',n') = (s,,(t"),n"), (2.15) 

n' 

f^^^^^^K + 1) if (t',n') = (r.(t"),r.(n" + e,)), 

otherwise. 



Proof. For the last three lines in the right-hand side of (f2T3D note that by ([232]) , G^'^'^ (f, n') /G^") (t" 
p(t', n')/p(t", n") if |n'| = |n"|, for the first line observe 

g(n,nO(nV + g^7^) J_ n'l - I _(„) 
g{n,n" j(n"r + q^„ ) ' n' "' '' 

I n'l -I g{n, n')q^;^Li^^„ _n'l-l qn»,n"-l 



iin' = n" -I {see ^EB)- □ 

Remark 2.4. Lemma 12.21 can be seen as a starting point for importance sampling: Any (quite 
possibly heuristic) approximation of G*-"'^(t", n")/G^"^(t', n') leads via ()2.14p to an approxima- 
tion of Qg which can be used as a proposal distribution. This is the 'A-coalescent equivalent' of 
Stephens & Donnelly's [SDnni Thm. 1] observat ion that the optimal distribution in the King- 
man context can be characterised in terms of the conditional distribution of an (n + l)-st sample 
given the types of n samples. 

2.3 Importance Sampling Schemes 

We have shown that the optimal proposal distribution Qg{-) is a Markov chain and derived 
expressions for the transition probabilities in terms of the Green function ()2.9p . Since recursive 
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evaluation of the Green function is equivalent to evaluating the likelihood, this is more of 
theoretical than direct practical value. 

Still, in the remaining sections we will present several proposal distributions based on Markov 
chains that approximate the optimal proposal distribution in reasonable ways so that the vari- 
ance of the estimator (|2.6p is small. We discuss separately situations in which the proposal 
distribution does not take any information about 'genealogical distance' between types (which 
is in principle provided by the IMS model) into account, and situations in which at least some 
of this information is retained. 

2.3.1 Importance sampling schemes without regard of 'genealogical distance' be- 
tw^een types 



Griffiths & Tavare's scheme for Lambda- coalescents. Griffiths and Tavare in |GT94j introduced 
a Monte Carlo method to estimate the likelihoods of mutation rates under Kingman's coalescent. 
This method was generalised in |BB08) to the multiple merger case and can be interpreted, as 
observed by Felsenstein et. al. |F99j . also as an importance sampling scheme. 

Indeed, it is easy to derive a proposal distribution from recursion (jl.lOp . recovering the 
scheme derived in |BB08j . For a given configuration (t,n) with complexity greater than 1 (i.e. 
excluding the root), define (with the usual convention n = |n|,r„ = A„ + nr) 



Uj-k + l 

i:n^ — 1, a; unique, 



\ i:ni>2k=2 ^ ^ 

+ ^ E E K + l))' (2-16) 

a;^Q unique 

and put feiiO), (1)) := 1 for the root. 

Definition 2.5 (Proposal distribution Qg'^"^). We denote by Qq''^'^ the law of a Markov chain 
on the space of histories with transitions, given a state (t,n), as follows: 

(t,n)- 



(Sj(t), n) W.p. ■r^j^'lt.n) ^ unique Sj(ti)) ^ tjVj, 

(rj(t),ri(n + ej)) w.p. r!jl^t]l) '-^ = ^'*''0 unique, s{ti) = tj, (2-17) 

jt,n-(fc-l)e,) w.p. ^-^^(^)A„,fc^fg^ «/2<fc<n.. 



To see why this yields a suitable Monte Carlo estimate, let r denote the random number 
of steps that our Markov chain performs until it hits the root configuration. Then, a simple 
calculation shows (see e.g. |BB081 Lemma 6]) that we may write 

T-1 

p(t,n)=E(t,„)n/e(^i), (2.18) 

1=0 

where the expectation is taken with respect to Qq''^'^ started in (t,n). 

Remark 2.6. This Monte Carlo method can be interpreted as an importance sampling scheme 
by choosing the proposal weights w{'H) according to 

^=0 
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Note that this method is a special case of general Monte Carlo methods for systems of linear 
equations with non-negative coefficients. It is therefore referred to as the 'canonical candidate' 
by [GT97] in the Kingman case and will serve us as a benchmark in Section [31 

Stephens & Donnelly's scheme for Lambda- coalescents. Stephens and Donnelly [SDOOj motivate 
and study a proposal distribution in a general finitely many alleles model under Kingman's 
coalescent. One can efficiently sample from their proposal distribution by choosing an individual 
from the current sample uniformly at random and then decide on the transition for the type 
of this individual. This is indeed optimal in the case of parent-independent mutations (see 
[SDOOj . Prop. 1). This procedure is adapted by Stephens and Donnelly to the infinitely many 
sites model in their Section 5.5. Here, not all types are eligible for a transition - only those 
whose multiplicity is at least two (which will then merge) or whose outmost mutation, say Xko, 
is unique. Denote the number of eligible individuals of a configuration (t,n) by 



z(t,n):= Yl 



m. 



iirij — 1 and x^q unique 

or n,- >2 



Under Kingman's coalescent, choosing (uniformly) an eligible individual is equivalent to propos- 
ing a transition step. Either a singleton is chosen, where the only possible most recent event is 
the removal of the outmost mutation, or an individual with a type that occurs at least twice in 
the sample is chosen leading to a binary merger. 

To adapt this approach to the Lambda-case note that when choosing an eligible singleton 
type the proposed step is unambiguous as in the previous case. However, the proposal needs 
additional information if a type with multiplicity greater than two is chosen, since then typically 
various multiple mergers of ancestral lines can occur. A natural approach to this problem is to 
choose the size of a merger with a probability proportional to the rates of the block counting 
process of the A-coalescent (see e.g. [BB08[ Section 7]). Based on this idea we introduce the 
following proposal distribution. 

Definition 2.7 (Proposal distribution Q^"^^). The proposal distribution Qg~^^ is the distri- 
bution of the Markov chain on the space of histories performing the transitions 



(t,n) 



(Sfc(t), n) w.p. z{t,n) ^-f ^ '■ ^ ^'^kO unique Sfc(xfe)) ^ Xj Vj 

{tk(t,Xk{n-\- Bj)) W.p. jjj-^ifk: nk = l,Xko unique (2-19) 

'(k)ni 



(t,n- (/c - l)ei) w.p. 4fei/2<fc<n„ 



where 



p(fc)=pf'")(fc)= (2.20) 

1^1=2 Qn,n-l+l 

for Ui > 2 is the probability derived from the block counting process (see \A.l]) that in the most 
recent merging event k lineages coalesce. 

Remark 2.8 (On optimality in the infinite alleles model). Hobolth et. al. showed in [HUWOS] 
that the proposal distribution of Stephens and Donnelly in the Kingman case is the optimal 
proposal distribution in the infinitely many alleles model (IMA), which is the prime example of 
a parent-independent mutation model. A crucial step in the proof is Ewens' sampling formula, 
which provides an explicit expression for the probability of a sample in the IMA. Since such an 
explicit formula is (at present) not available in the IMA for Lambda-coalescents, we may express 
the optimal proposal distribution only implicitly via a recursion of Mohle [M06j . Indeed, let 
c = (ci, . . . , Cfc, 0, . . . ) G (No)°° denote an allelic partition of a sample in the IMA, that is, Cj 
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is the number of types that occur i times in the sample. Then, the sampUng probabiUty q{c) 
satisfies 

q{c) = —q{c - ei + E E ■ ^ ^ + " ei+j 2.21 

with n = ^ - ici. The boundary condition is g((l, 0, ...)) = 1 and we set q{c) = if any entry 
in c is negative. Further, let (p : {t,n) c be the function which maps a sample (t,n) in the 
infinitely many sites model, which we think of being generated by Algorithm 1 (the 'A-Ethier 
Griffiths Urn'), to the corresponding allelic partition c in the infinite alleles model (i.e. where 
Ci = #{types k with Uk = = 1,2, .. . ). Let p^-^'^^ be the image measure of the sample 
distribution under cj). Then, using conditional probabilities, the optimal sampling distribution 
Q*,iMA ^j^^ infinite alleles model has transitions 

Q*'™^(c'|c) = p^-^^^{c\c')^, (2.22) 

where c' = c — ei or c' = c + ej — ej+i for some i,j € {1, . . . , n — 1} are the only possible 
transitions. Unfortunately, unlike the Kingman case (where the Ewens sampling formula is at 
hand), there is no explicit closed solution to the recursion ()2.2ip . However, for a given sample 
size, the solution to ()2.2ip could be precomputed and stored in a large database (this is much 
easier than in the case of our original recursion for (t,n) since no explicit type configurations 
t need to be stored). This would yield a perfect sampler (given a suitable database) for the 
infinite alleles model in the A-case. Still, since a lot of information is lost via our map (j), it 
is unclear if this would lead to a good sampler for the infinitely many sites case. We refer to 
|M06j and |DGP06 j for a more thorough investigation of the infinitely many alleles model in 
the A-coalescent case. 

2.3.2 Schemes based on compressed genetrees 

In the following, we will abbreviate the transition matrix of the time-reversed history under Qg 
by 

Q*ei{t",n") ^ (t',n')) := Q^(i?,-i = (t',n') | He = (t",n")) 

for any (t', n'), (t", n") G T*, which is well-defined irrespective of the 'target' sample size n 
appearing in fg^n (see Lemma l2.2p . 

In this section, our goal is to derive proposal distributions for the infinitely many sites model, 
where at least partial information about the structure of the type configuration t is retained. 

In the simplest case the idea (due to |HUW08] ) is to subsequently focus on a single mutation 
in the genetree and then to consider the corresponding "compressed" genetree, in which this 
is indeed the only mutation at all. For such a simple compressed tree, the optimal transition 
probabilities can be computed explicitly (at least numerically). Summing over the mutations, 
these probabilities are then composed to a proposal for the original tree. 

This approach will be explained and extended to the Lambda-coalescent in the next subsec- 
tion. After that, we will show how to extend this framework to retain more information about 
the tree, in particular taking pairs of mutations (and potentially even more) into account. 

Hobolth, Uyenoyama & Wiuf 's Scheme for Lambda-coalescents. Let (t, n) be a sample 
with ordered types. Since we will consider individual mutations, for the purposes of this section, 
we think of a fixed representative under the mutation relabelling relation ~ from Section 11.21 

Pick a segregating site, say s' G {1, . . . , s}. We first introduce the 'compressed genetree' of 
[t, n] with regard to the mutation at the segregating site s'. Denote by 
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(2) n — d 



O 



n 



d 



(a) The sample configuration [^A^] 



(b) Tiie sample configuration 
[Mo]-. 



Figure 1: The sample configurations [A^^Jsi and [7Wq]~. The sample has one segregating site 
respectively no segregating site. 



d(s') = d(s', (t,n)) = m 



i : type i carries 
a mutation at s 



the number of individuals in the sample bearing a mutation at the segregating site s'. Let 

>[3 := (((0),(l,0)),(n-d,d)), nGN, {0,...,n}, (2.23) 
be the genetree where d individuals bear a mutation and n — d do not. Note that 

A^^=(((0)),(n)) (2.24) 



is the configuration where all n individuals share the same type. See Figure [TJ 

Definition 2.9 (compressed genetree). Let (t,n) be a sample of size n with s > 1 segregating 
sites. Let s' £ {l,...,s}. Then, we define the 'compressed genetree' (t,n)(s') with respect to 
the segregating site s' as 



(t, n)(.') := = (((0), (1, 0)) , {n - d{s'), d{s')) ^ 

where d{s') is the number of individuals carrying mutation s' in the sample. 

We now explain how to derive from the optimal proposal distribution for the corresponding 
compressed trees a proposal distribution for the original genetree. To this end, fix s' E {1, . . . , s} 
and let po{n, d) be the probability that the most recent mutation in (t, n) affected an individual 
out of the d = d{s') individuals exhibiting a mutation at segregating site s', that is 

„(n.<i) = (^-2;(A.3^Air;) if<'>l. p.25) 

In the first case the most recent event was a merger of any size involving individuals bearing the 
mutation, whereas in the second case the last event was the origin of the mutation. Further, 
define 

ipein.ds')^ if type i carries a mutation at s' 

ul'\i) := {^"^ ' "'^s' (2.26) 
1^(1 — pe{n, t^s')) n-d / type i does not carry a mutation at s . 

Note that ni/d is the fraction of genes of type i among those genes carrying a mutation at 
segregating site s', thus u^g \i) would be the exact probability that the most recent event in 
the history involves type i if s' were the only segregating site. 
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Definition 2.10 (Eligible types). Let (t,n) be a genetree with d types. We say that the k-th 
type, where k € {1, . . . ,d}, is eligible for transition (or short: eligible if either > 2 or 
nfc = 1 and x^o is unique. 

We are now ready to state a Lambda-coalescent extension of the |HUW08] proposal distri- 
bution (for A = Sq, it agrees with that from |HUW08"] ). 

Definition 2.11 (Proposal distribution Q^"^^^^). We denote by Q^-^uw^ ^/^g i^^^ of a Markov 
chain on the space of (time-reversed) histories Ti, starting from samples of size n, if its transition 
probabilities from a state (t',n') can be described as follows: 

• Pick a type, say k, from the set of eligible types of (t', n') at probability 

T.l'=A'\k)^ 

Ylli eligible X]s' = l (^)- 

• // the multiplicity of the chosen type k is one remove the outmost mutation. 

• If the multiplicity n'^ is larger than one, perform a merger inside this group. The size of 
the merger is determined as follows: 

— If type k does not bear a mutation, then, an I + 1 merger, for \ < I < Uk, hap- 
pens with probability proportional to QI{Mq Mq^''), where QI{Mq — > Mq) = 
g{n,n')qn'i/G^'^\A4Q ) is the probability of jumping to the terminal state. 

— If type k bears at least one mutation, let s' be the segregating site corresponding to its 
outmost mutation XkQ. Let d{s') be the number of individuals in the sample bearing 
a mutation at this segregating site. Then, an I + 1 merger, for 1 < / < n^, happens 
with probability proportional to Q*Q{M.'^f^^,-^ — -^^(^)_;)- 

Remark 2.12. (i) The quantities pe{n, d) and the proposal of the merging size involve the optimal 
proposal distribution for samples with at most one segregating site so these quantities can be 
easily computed numerically and kept in a lookup table. 

(ii) Hobolth et. al. showed in |HUW08"1 Theorem 2] that if the sample is of size 2, then the optimal 
proposal distribution chooses one of the two types proportional to the number of mutations it 
differs from the root of the genetree. They note in |HUW08"| Remark 3] that their proposal 
distribution equals the optimal one in that case. The same statement is true for Q^-HUW ^ g^j^ce 
the dynamics of a sample of size two does depend on A only through the total mass. For more 
general samples this effect should also favour types that have a large number of mutations. 

Figure [2] depicts a sample configuration and all corresponding compressed genetrees with 
one mutation. Hobolth et. al. provide in [HUW08 ] explicit formulae for the optimal transition 
probabilities for samples with just one visible mutation if the underlying genealogy is given by 
Kingman's coalescent. 

Schemes regarding Pairs of Mutations We now extend the approach of |HUW08] to con- 
sider compressed genetrees which allow two mutations. First note that there are two kinds of 
structurally distinct genetrees with two mutations. 

Definition 2.13. Let 

Ml^d, ■■= ([(0),(l,0),(2,0)]^,(n-di-d2,di,d2)) (2.27) 
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.0) 



1 2 2 1 3 3 



,4) 2 (5) 4 



carrying: 
(o) 4 = n-di 



^ 8 = di 
© 8 

i ^(5) 4 



not carrying: 
^(o) 11 i ^(o) 9 



® 1 (3) 3 

i ^(o) 10 



@ 2 



Figure 2: A sample configuration is depicted on the left and a type i is marked. On the right 
all possible compressed genetrees are listed. The type corresponding to i is marked in the 
compressed genetrees. Either type i corresponds to the type carrying the mutation or not. 




&)di-d2 



®d2 

(a) The genetree for the sam- (b) The genetree for the sam- 

ple configuration [A^dj^.djjsi pie configuration [A/J^.dal- 

Figure 3: The two different sample configurations of size n with two segregating sites (or 
mutations), di individuals carry mutation i, i = 1,2. 



be the genetree where the two mutations are on different branches. The number of individuals 
carrying mutation m is dm- Denote by 

^1^2 ■= ([(0),(l,0),(2,l,0)]^,(n-di,di-d2,d2)) (2.28) 

the sample configuration where the mutations are on the same branch. The number of individuals 
carrying only mutation 1 is di — d2 and both mutations are carried by d2 individuals. 

The two possible types of genetrees are depicted in Figure [3l 
Remark 2.14. (i) Note that M.^^ = di (^^ equivalence classes under r^) holds for di+d2 < 
n. Furthermore, note that with di + d2 = n, Mq^^ and A/j^^^ denote valid genetrees with 

two mutations (even though for the latter two, mutation 1 is then not segregating), whereas by 
a slight abuse of notation q = M-Qdi — ^"^^ M^^ q = TW^i denote genetrees with only 
one segregating site. We denote by M.qq = Mqq = M.q the sample of size n with no segregating 
sites. 

We now introduce the notion of a compressed genetree with regard to pairs of mutations. 

Definition 2.15 (compressed genetree). Let [t,n] be a genetree with s > 2 segregating sites. 
Let s', s" € {1, . . . , s}. Then, we denote the 'compressed genetree' with respect to the segregating 
sites s',.s" by [t, n](s', s"), where 

[t,n](/,.") = A^rf(.0,d(.") 
if there is no type in [t,n] which carries mutations at both s' and s" , and 

[t,n](.',.")=AA^(.0,d(.") 



16 




i ^# 1 




1 



2 



® 2 



®2 



(a) The two mutations considered in this 
example are mutation 1 and 8. The com- 
pressed genetree is of the form A/J^ ^^'^ 



(b) In this example mutation 3 and 5 are con- 
sidered. The compressed genetree is of the form 
A^2i,d2 ^^'^ type i corresponds to the root type. 



type i corresponds to mutation 1. 



Figure 4: Two examples of genetree compressions. The type i is identified with one of the three 
types in the compressed sample by this procedure. 

if there is at least one type in [t,n] which carries mutations at both s' and s" , and there is no 
type which carries a mutation at s" hut not at s' . 

To consider pairs of mutations determining the probabilities of performing a step involving 
type /c in a general sample configuration (t,n) it is necessary to know the relation of the 
outmost mutation x^o of type k (if it carries a mutation at all) to the given pair of mutations 
(resp. the corresponding segregating sites) in the genetree. In other words, the type in the 
compressed genetree corresponding to type k needs to be determined. Based on this information 
the appropriate most recent event in the history of the compressed tree can be chosen. Figure H] 
shows two examples of compressed genetrees for two given segregating sites. By symmetry, this 
relation can be described by one of five distinct cases. 

Definition 2.16. Let [t,n] he a genetree with s > 2 segregating sites and let k G {l,--- ,d}. 

Let s', s" G {1, . . . , s} he two segregating sites. Then, we distinguish the following cases: 

Case I if type k hears mutations at s' and s" , 

Case II if type k hears a mutation at s' , hut not at s" , 



The five cases are depicted in Figure O 

Again, we will now derive proposal distributions based on optimal proposals for the com- 
pressed trees. To this end, note that the optimal transition probabilities for samples with two 
mutations can be calculated numerically. To determine the transitions of the proposal Markov 
chain until it hits the root configuration corresponding to TWj the transition probabilities for 
the cases with one mutation or zero mutations also have to be precomputed. Thus, we shall set 
the probability weights for transitions involving samples with at most two mutations equal to 
the optimal weights in all the proposal distributions defined below (at no extra computational 
cost). 

Fix a sample [t, n] with d different types and at least two segregating sites s', s" . Note that 
a possible transition of the proposal Markov chain can be characterised by a pair {i, I) with 



Case III 



and there exists a type carrying hoth mutations, 
if type k hears a mutation at s' , but not at s" , 



Case V 



Case IV 



and there exists no type carrying both mutations, 
if type k does not bear any mutation at s' or s" , 
and there exists a type carrying hoth mutations, 
if type k does not hear any mutation at s' or s" , 
there exists no type carrying hoth mutations. 
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® n - d{s') @n- d{s') 



I n - d{s') 



Q) d(s') - dis") d{s') - d{s") ^ n - d{s') - d{s") dis') - d{s") 



p d{s") 
(a) Case I 




d{s") 

(b) Case II (c) Case III 

in- d{.s') - d{s") 



d{s") 
(d) Case IV 




d{s") 
(e) Case V 



Figure 5: The five different cases of types being affected by the most recent event in the two 
genetrees corresponding to configurations Al^^^,^ ^^^„-) (Case III and V) and A^2(s') d{s") (Case 
I, II and IV). The shaded node refers to the proposed type. 



1 < i < d and < / < nj — 1, where i denotes the type that is involved in the most recent event 
and / denotes the amount by which the multipHcity is decreased. Denote by Z = the case that 
the outmost mutation of type i is removed from the genetree (if type i is an ehgible singleton). 
Now define, for I > 1, the quantity 



d{s") Q*eV^d{s' 



ui'''"'(M) 



d{s')4{s") ~' ^d(s')~l,d{s")~l) 
d{s')-d{s") ^ev'^d{s'),d{s") ^ ■^^d{s')-l,d{s")' 



d(?y^6»V-""-d(s'),d(s") 
n-d{s')^e\-'^ d{s')4{s") 



s')-l,d{s") 
■^^ d(s'),d{s")' 



in Case I, 
in Case II, 
in Case III, 
in Case IV, 



{ n-d{s'Uis") Q*eiM''dis')As") ^ ^diJ)As")^ Case V. 



For Z = let 



d{s")Qei-^d{s'),i 


d(Fy2e(-^d(s"),i 






^d(s'),o)^W«")=i} 



in Case I, 
in Case II, 



in Case IV, 
in Case V. 



(2.29) 



(2.30) 



Note that in Case I, III and IV the order of the mutations in the compressed tree -^^{s') d{s") 

determined by their order in the original genetree. Analogous to (|2.26|) . Ug* would be 

the optimal probability weight of transition if only the two mutations s' and s" existed in 
the data. 

Finally, we define for each type i,l < i < d, and segregating sites s', s", 



E 
E 
E 
E 
E 



1=0 



)-l,W,s"} 



(i,/) 

(s')-d(s")-l 
1=0 



hi) 



(s')-l {s',s"} 

1=0 '^e 

n-d{s')^l {s',s"} 

1=0 

n-dis')--dis")-l {s',s"} 
1=0 ^9 ' 



in Case I, 
in Case II, 
in Case III, 
in Case IV, 
i, I) in Case V. 



(2.31) 
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We will now use these quantities as probability weights for the event that in the compressed 
genetree [t, n](s', s"), the last event in the history involved type k (under our new proposal 
distributions). 

Definition 2.17 (Probability weights for picking eligible types). Given [t,n] with d types and 
s >2 segregating sites, let, for each eligible k £ {1, . . . d}. 



Qli[t,n]){k) :-- 



2^{i<s'<s"<s}'^e \'^) 



j=l /-/{l<s'<s"<s} V') 

If k is not eligible, put (5g([t, n])(A;) = 0. 

This distribution can be used to propose a type to be involved in the most recent event. 
In a second step one may then choose the size of the possible merger, similar as before in that 
again the probabilities of the merger sizes in a specific sample, now with two mutations, are 
considered. 

Definition 2.18 (Proposal distribution Q^"^^^^"). We define a distribution Q^-HUW^a 
space of histories % as the law of a Markov chain with transitions as follows. Let [t,n] be a 
sample configuration with at least s >2 segregating sites and d > 1 types. 

• Choose a type i € {1, . . . ,d} to be involved in the most recent event in history according 
to Ql{[t,n]){k) from Definition \2T} . 



^/'^i = 1) remove the outmost mutation of type i (noting that a.s. only eligible types can 
be chosen). 

If > 2, and i bears at least one mutation, let s' be the segregating site corresponding 
to the outmost mutation of type i. Reduce the multiplicity of type i by I with probability 
Qei-^d{s'),d{s')-ni ~^ ■^d{s')-l,d(s')-ri.)- V ^VP^ ^ ^VP^' rcducc the multiplicity by 

I with probability QKMn-m ^ -^n^LnJ- 
Alternatively, one may consider all mutations present in the sample. 

Definition 2.19 (Proposal distribution Q^'^^^ We define a distribution Q^~^^^ ^ on the 
space of histories % as the law of a Markov chain with transitions as follows. Let [t,n] be a 
sample configuration with at least s > 2 segregating sites and d>\ types. 

• Choose a type i € {1, . . . ,d} to be involved in the most recent event in history according 
to Ql{[t,r\]){k) from Definition \2.11 . 

• If Ui = 1, remove the outmost mutation of type i (noting that a.s. only eligible types can 
be chosen). 

• If ni> 2, choose to decrease ui by I with probability proportional to 

Y.uY''"\i,l\ (2.32) 

s',s" 

where the sum extends over all pairs of segregating sites present in the current sample. 

It might appear artificial to consider choosing transition by such a two-step procedure instead 
of choosing all at once. Indeed, the method of [HUWOS] can be extended in another direction 
by choosing the type involved in the most recent event and the size of the possible merger in 
one step. We present two proposal distributions that let pairs of mutations valuate all possible 
transitions and then the most recent step is chosen proportionally to these weights. 
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Figure 6: Type II has multiplicity one and is a descendant of type I. Thus all pairs of mutations 
that do not include the outmost mutation of type II weigh the step removing the outmost 
mutation of type two with zero. The table on the right shows that Q^-HUW A -g g^Qggp ^j^g 
optimal distribution than qA-huw^b 



Definition 2.20 (Proposal Distribution 



We define a distribution 



A-HUW^B 



on 



the space of histories H as the law of a Markov chain with transitions as follows. Let [t,n] be a 
sample configuration with at least s >2 segregating sites and d > 2 types. We propose the event 
{i, I) for 1 < i < d and < Z < rij — 1 to be the most recent evolutionary event with probability 
proportional to 

{J2{s',s"}^f' ^hl) 'if i is eligible 
otherwise. 



(2.33) 



Note that in a given sample (t,n), by ()2.30p the contribution of the presence of a pair 
of mutations at {s', s"} to the event (z, 0) of removing the outmost mutation of a leaf type i 
is zero if the corresponding d{s') resp. d{s") is greater than one. In a generic genetree this 
case appears rather frequently and thus we argue that the proposal distribution qA-huw b 
underrates mutation events. This effect is illustrated in Figured 

To circumvent this problem, one may modify the proposal distribution by summing only 
over those pairs of mutations where one of the mutations coincides with the outmost mutation 
of the current type. This should reduce the number of pairs that put too much emphasis on 
the merging events and establish a more balanced proposal distribution. 

Definition 2.21 (Proposal Distribution Q^-huw^a^^ define a distribution Qj-HUW^A 
the space of histories % as the law of a Markov chain with transitions as follows. Let [t,n] be a 
sample configuration with at least s >2 segregating sites and d >2 types. We propose the event 
{i, I) for 1 < i < d and < Z < rij — 1 to be the most recent evolutionary event with probability 
proportional to 



eligible and the root type 
if i is eligible 
otherwise, 



(2.34) 



where is the segregating site corresponding to the outmost mutation Xio of type i. 

Remark 2.22. (i) Another positive side effect of this method is that it reduces the complexity 
of proposing a step from quadratic to linear in the number of mutations. 

(ii) In Hobolth &: Wiuf |HW09j . Section 4, explicit expressions for the sampling probabilities in 
the case of Kingman's coalescent for samples with two (nested) segregating sites are presented. 
The authors note in Section 7 that their results 'could potentially be used to further improve 
the proposal distribution for inference in coalescent models.' Indeed, via Remark 12.31 their 
results can be applied to derive explicit formulae for the quantities u^*"'* \i, I) that govern the 
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proposal distributions regarding pairs of mutations (for A = Sq). 

(iii) Note that the idea to let mutations valuate all possible transitions can also be applied for 
the case when just one mutation at a time is considered in the sense of Q^-^uw ^ 

Our last proposal distribution combines the single-mutation approach with the pair ap- 
proach. Indeed, note that the complexity of the proposal distributions regarding pairs of mu- 
tations is quadratic in the number of mutations, whereas the proposal distributions regarding 
all mutations have linear complexity. We will see in Section [3] that the real-time to compute 
steps for the distributions differ. However, we find that the method determining the size of the 
merger in proposal distribution Q^-huw a fj-Qj^ Definition 12. 181 performs well. Thus a promising 
candidate concerning speed and performance should be given by the combination of proposing 
a type in the first step considering all mutations separately and then choosing the merging size 
in the second step by the method from Q^-HUW 

Definition 2.23 (Proposal Distribution Qj-^'^^^ ''). We define a distribution Qj-^^w^ '' on 
the space of histories % as the law of a Markov chain with transitions as follows. Let (t, n) be 
a sample configuration with at least s > 2 segregating sites and d > 2 types. Choose type i to be 
involved in the most recent event considering all mutations according to the same method used 
for the distribution Q^-^uw j^q^ Definition If a singleton type is chosen, remove the 

outmost mutation, whereas in the case of a non-singleton type i with rij > 2 the multiplicity is 
decreased by I with probability Qe{^d{o),d{o)-n, ^ K{o)-i,d(o)-nJ' ^'^^^^ ^<l<ni-l. 

Remark 2.24. For the analysis of a sample of size n, the proposal schemes from (|2.18t I2.19[ 
12.201 12.211 12.23P all require the numerical computation of the solution of (jl.lOp for all samples 
of size m < n with at most two segregating sites. This can be precomputed, but should be kept 
in the computer's main memory during the (many) repeated runs. Thus, memory requirements 
can be a limiting factor prohibiting the analysis of large samples. 

Since in a sample of size m < n with at most two mutations under the IMS there are at 
most three types (of several possible multiplicities), memory of the order will be required. 
For further speed-up one could also store the transition probabilities for all possible moves for 
each sample, which would result in a requirement of the order n^. 



3 Performance Comparison 

In this section we investigate and compare the performance of the different proposal distribu- 
tions, introduced in Section [2.31 iii various scenarios by means of a (not necessarily comprehen- 
sive) simulation study. 

Such a study faces two particular issues which need to be addressed. First, one needs to 
identify (preferably parametric) sub-families of Lambda-coalescents which might be of biological 
relevance (i.e. arise from microscopic modelling of the behaviour of the underlying population). 
We will focus our attention to so-called Beta- coales cents, recalled below. A second issue is owed 
to the fact that tractable sample complexities are still in the low three-digit numbers (~ 100). 
If one wishes to compare the performance of our sampling schemes one has to use either a few 
less generic scenarios where the samples have relatively large complexities or many samples of 
small complexity 

This section can be outlined as follows: First, we introduce and discuss the class of Beta- 
coalescents. Then, we measure empirically the total variation distance between our proposal 
distributions and the optimal distribution for a small sample complexity. Next, we compare the 
concrete performance of our schemes for several randomly generated samples of small size for 
various scenarios, and for several relatively large real DNA sequence data samples. Finally, we 
will discuss our results and try to come up with recommendations for the practitioner. 
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r = 0.5 


r = 1 


r = 2 




a = 1 


a = 1.5 


a = 2 


a = 1 


a = 1.5 


a = 2 


a = 1 


a = 1.5 


a = 2 




0.166 


0.118 


0.080 


0.172 


0.134 


0.088 


0.127 


0.114 


0.084 


, — 


0.226 


0.114 


0.060 


0.220 


0.142 


0.088 


0.151 


0.115 


0.084 


2 


0.115 


0.077 


0.045 


0.119 


0.102 


0.074 


0.083 


0.082 


0.071 


2 


0.069 


0.058 


0.039 


0.088 


0.096 


0.084 


0.068 


0.082 


0.091 


2 


0.054 


0.047 


0.038 


0.064 


0.065 


0.063 


0.053 


0.055 


0.060 


2 


0.063 


0.044 


0.026 


0.081 


0.072 


0.053 


0.060 


0.062 


0.055 


^A-HUW' a 



0.058 


0.041 


0.026 


0.076 


0.069 


0.053 


0.058 


0.059 


0.055 


oA-HUW-'-' 


0.092 


0.063 


0.038 


0.111 


0.097 


0.071 


0.081 


0.081 


0.071 



Table 1: Total variation distance between optimal proposal distribution and importance sampling 
schemes, averaged over all samples of complexity 15. 



3.1 Beta-coalescents 

Recall that our 'parameter' 9 = (r, A) consists of the mutation rate r and the underlying 
Lambda-coalescent with coalescent measure A. The case where A = is the classical Kingman 
case describing populations with constant population size and reproduction events which are 
small when compared to the total population size. Here, we will consider the case where 
A = B(2 — a,a), with a £ (0,2), that is, so-called 'Beta-coalescents' introduced by |S03j . whose 
density is given by 

A{dx) = ^^^1 x'-^il - dx. 
i (z — a)L [a) 

Note that the Kingman-coalescent corresponds to the weak limit as a — >■ 2. See, e.g., |S03j or 
[BB09] and the references there for a discussion of possible biological motivations of this class. 



3.2 Distance to the optimal proposal distribution 

For small sample complexities, it is possible to solve our recursions (|1.6p . (jl.Op and (jl.lOp nu- 
merically and hence to compute optimal proposal weights directly. It is therefore natural to 
measure the distance between the optimal proposal distribution and our candidate distributions 
for such small complexities. We consider a selection of parameter values for the Beta-coalescent 
(including the Kingman-coalescent) in Table [1] and present the total variation distance of the 
optimal weights of the possible steps and the weights given by the candidate distribution av- 
eraged over all possible samples of complexity 15. In enumerating all these samples, viewed as 
trees, we have found algorithms from |K05j very helpful. 

The relative ranking of the different candidates implied by the total variation distance is 
similar when using the mean-squared distance or the relative entropy (data not shown). The 
respective minimisers are printed in bold. 

The best results are consistently provided by methods based on compressed genetrees with 
two mutations, namely Q^-HUW^A^ qA-huw^o ^^^^ Qg~^^^ ^ . This is true not only for the 
Beta-coalescent, but in particular for Kingman's coalescent, so that our new methods seem 
to outperform even the classical methods known so far, at least with respect to this rather 
theoretical criterion. 



3.3 Performance comparison for different specific tree structures 

In this subsection, we aim to investigate strengths and weaknesses of our methods depending 
on the structure of the genetrees encoded by the datasets. 
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n 1 2 l| I 1 1 1 3 9 I |4 1 1 3 1 4 1 

(a)r = 0.5, a = 2 (b) r = 0.5, a = 1 (c) r = 1.5, a = 1.5 



Figure 7: Trees showing an average number of mutations out of 500 simulated trees under the 
respective parameters (leaf labels correspond to type multiplicities). 
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(a) r — 0.5, a — 1 



4 2 2 3 1 1 1 1 
(b) r = 1.5, a = 2 



111 



12 111 



(c) r — 1.5, a — 1 



Figure 8: Trees showing a number of mutations that equals the empirical 80% quantile of 500 
simulated trees under the respective parameters. 

To this end, we simulated 500 genetrees under given parameters (for Beta-coalescents) of 
sample size 15. Note that the corresponding tree complexities vary and can be much bigger than 
15. From these 500 trees, we a) uniformly pick one tree with an 'average' number of mutations 
(note that the distribution of the number of mutations can easily be computed recursively) and 
b) choose a tree with a number of mutations according to the empirical 80% quantile of the 
500 simulated trees (i.e. a tree with 'many' mutations). . Sample trees chosen according to 
other criteria of 'atypically high sample complexity' yielded similar results to those from case b 
(data not shown). The computations were carried out using MetaGeneTree on computers with 
a standard performance (using AMD Opteron CPUs with 2.6 GHz). 

We begin with a), an average tree (with respect to number of mutations, for the given pa- 
rameters), and investigate the performance of our methods for three different parameter values. 
Figure \7\ shows the genetrees and the respective parameters used for its generation. Figure [9] 
shows the respective number of runs and computing time needed so that the relative empirical 
error of the likelihood estimate becomes smaller than 1%. Again, our proposal distributions 
based on compressed genetrees fare rather well, with the notable exception of Q^-huw b_ 

b) Our next set of genetrees corresponds to the 80% quantile with respect to the num- 
ber of mutations on the tree (i.e. trees with an exceptionally large number of mutations, and 
therefore relatively high tree complexity). Figure [8] shows the genetrees and the respective 
parameters and Figure [10] gives the number of runs and computing time needed so that the 
relative empirical error of the likelihood estimate becomes smaller than 1%. As expected, the 
average computational time, due to increased complexity, increases significantly in comparison 
to an 'average' tree. The relative performance of our methods, however, remains similar - in 
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(a) Base- 10 logarithm ol the number of runs (b) Base- 10 logarithm of the computing time 
needed (log(#)). needed in seconds (log(t)). 



Figure 9: Number of runs and computing time needed to obtain a relative error below 1% for the 
'average trees' given in Figure [71 

particular, Q^-HUW^A pgi-forms best. 

3.4 Average performance over many samples 

We simulated 100 samples under a given pair of parameters and estimated the likelihood of these 
samples for the same parameters. Whereas for the analysis in the previous section we provided 
the exact number of runs, we now cumulated additional simulation runs until the relative error 
dropped below 1%, increasing the number of new runs by a factor of 4 in each step. Density 



plots for the number of runs needed to achieve this are given in Figure 11(a) for the parameters 



(1, 1.5) and in Figure 12(a) for the parameters (1, 2) for selected proposal distributions. 

As before, we also measured the time required to achieve a relative error below 1% in term 
of the actual computing time in seconds. The base-10 logarithms of the corresponding times are 



given in Figure 11(b) and Figure 12(b) for selected proposal distributions. Since one simulated 
sample for a = 2 showed no mutations, we assumed a duration of zero. For a = 1.5 (Figure [TT]l 
the proposal distribution Q^-HUW A g^gg^jj^ performs better than the others. However, for a = 2 
(Figure [T2]) performances are very similar with even a slight disadvantage for Q^-HUW^A 
terms of computing time. 

3.5 Performance on real data sets 

So far we have only dealt with simulated datasets of relatively small complexity. We now analyse 
the performance of our methods on various real datasets. 

We begin with a famous and well-studied dataset consisting of mitochondrial data sampled 
by Ward et. al. ([WFDP9T]) from the North American Nuu Chah Nulth tiihe. The corresponding 
genetree is given in Figure [T3j These samples were analysed in a framework similar to ours in 
|GT94j and |HUW08j . and we use the data in the form edited by Griffiths and Tavare in [GT941 
Figure 3]. We first estimated the maximum likelihood values for the mutation rate r and the 
parameter for the Beta-coalescent a on a discrete grid. The values are given in Tabled Details 
of this method and possible biological implications will be discussed elsewhere. We then used 
the estimated parameters to perform the same analysis as in Section 13.31 ttiat is we determined 
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(a) Base- 10 logarithm of the number of (b) Base- 10 logarithm of the computing time 
runs needed (log(#)). needed in seconds (log(t)). 



Figure 10: Number of runs and computing time needed to obtain a relative error below 1% for 
the trees of high relative complexity given in Figure [HI 
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Figure 11: Empirical distributions for the number of runs and the real-time for 100 samples 
of size 15, simulated with r = 1 and a = 1.5. The likehhood was computed for the same 
parameters. 
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log(t) 
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Figure 12: Empirical distributions for the number of runs and the real-time for 100 samples of 
size 15 with, simulated with r = 1 and a = 2. Again, the likelihood was computed for the same 
parameters. 

the number of independent runs and the computing time to estimate the likelihood value at this 
point in the parameter space with a relative error below 1%. The result is given in Figure [TU] 
(cf. the symbol related to |GT94| ). Again the proposal distributions using pairs of mutations 
show good performance when the number of runs is considered. However, this advantage almost 
vanishes when the total computation time is considered. Still, Q^"^^^ ^ performs best. 

Currently, evolutionary mechanisms to describe actual biological populations which might 
give rise to Lambda-coalescent like genealogies (see e.g. |EW06j ) are being discussed. In this 
subsection we will further compare the performance of our methods on the datasets consid- 
ered in |A04j . namely mitochondrial cytochrome h DNA variation data sampled from various 
subpopulations of Atlantic Cod (Gadus Morhua). These datasets, depicted in Figure [2] and 
Figure are taken from [AP96j , |APP98j (only from the Baltic transition area) , [APKSOO] 
(only the Greenland subsample), j(M91j . !PC93j and [SA03] (only cyt b data). 

Again, we estimated the maximum likelihood values for the mutation rate r and the param- 
eter for the Beta-coalescent a on a discrete grid and proceeded in a similar way as for the Nuu 
Chah Nulth data. The estimated parameter values are given in Table [2] and Figure [ripl and 
Figure [T7] show the results of the runtime analysis. 

Again the proposal distributions using pairs of mutations show a strong performance when 
the number of runs is considered. However, this advantage vanishes when the computation 
time is considered, where for some samples Q^~^^ and Q^-^uw gyen perform better. To some 
extend this can be attributed to the increased effort the proposal distributions using pairs of 
mutations have to invest in the precalculation. 

3.6 Conclusion and guidelines for the practitioner 

Table[T]shows that for a wide range of parameters and samples with complexity 15, the proposal 
distributions using pairs of mutations are typically closer to the optimal proposal distribution 

^The analysis for [GT94] under qA-huw b gj^^^g^j ^ relative error of 7 % after 27 million runs taking 32 days. 
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Figure 13: The genetree corresponding to the dataset from |GT94j . 
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(a) [AP96] 



(b) [APP98] 



(c) [APKSOO] 



Figure 14: The genetrees corresponding to the datasets from |AP96] . [APP98j and [APKSOO] . 
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(a) [CM91] 



(b) [PC93] 



(c) [SA03] 



Figure 15: The genetrees corresponding to the datasets from |CM91] . |PC93j and [SAOSj . 
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(a) Base- 10 logarithm ol the number of runs (b) Base- 10 logarithm of the computing time 
needed (log(#)). needed in seconds (log(t)). 



Figure 16: Number of runs and computing time needed to obtain a relative error below 1% for the 
genetrees corresponding to the datasets from |GT94| , [AP96) and |APP98| given in Figure [13] and [Ml 
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(a) Base- 10 logarithm of the number of runs (b) Base- 10 logarithm of the computing time 
needed (log(#)). needed in seconds (log(t)). 



Figure 17: Number of runs and computing time needed to obtain a relative error below 1% for the 
genetrees corresponding to the datasets from |APKSOO| . |CM91| . |PC93] and [SAOSj given in Figure 1141 
and[I3 
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GT9.1 


AP96 


APP98 


APKSOO 


CM91 


PC93 


SA03 


n 


55 


100 


109 


78 


55 


103 


74 


(f, a) 


(2.4, 2.0) 


(0.7, 1.65) 


(0.6, 1.55) 


(0.7, 1.65) 


(0.8, 1.4) 


(0.6, 1.4) 


(0.7, 1.3) 


P"[t,n] 


9.02 ■ 10"^" 


2.25 ■ 10"^^ 


2.19 ■ 10-^* 


2.26 ■ 10"^^ 


3.80 ■ 10"^ 


1.64 ■ 10"^" 


6.44 • 10-" 


c(t, n) 


1 


2 


2 


6 


6 


4 


96 


p[t,n] 


9.02 ■ 10-^" 


1.13 ■ 10-" 


1.10 ■ lO-^"* 


3.77 ■ 10-" 


6.33 ■ 10-"J 


4.10 ■ 10-^^ 


6.71 ■ 10-" 



Table 2: True probabilities p[t,n] under estimated ML parameters (within the Beta(2 — a,a)- 
class; MLE on a discrete grid) combinatorial factors c(t,n), and likelihoods p[t,n] for the real 
datasets. 



than the proposal distributions using less detailed information from the sample. The distribu- 
tion Q^~^^ performs better then the 'standard' Qq''^'^ , but is in turn outperformed by Q^-^uw ^ 
This relative ranking of distributions in principle holds throughout the subsequent analysis in 
Sections 13.31 13.41 and 13.51 The proposal distributions using pairs of mutations perform consis- 
tently better than the others when the number of independent runs is considered. Note that 
qA-huw b -g g^j^ exception to this rule (this fits to the observation from p. [20] that Q^"^^^ ^ 
underrates mutation events). 

However, when considering overall computation time, this clear picture changes. Though the 
methods using compressed genetrees still outperform Q^'^"^ and Qg"^^ in most cases, Q^-huw 
shows a performance comparable to our methods using pairs of mutations. On the one hand 
this can be attributed to the actual implementation, on the other hand the computational com- 
plexities per proposal step for different proposal distributions do differ, ranging from constant 
{Qg~^^ and Qq~^^) to linear {Qg~^^^ and Q^"^uw A-j quadratic in the number of segregat- 
ing sites. When real computation time is considered, the proposal distribution Q^-HUW^ seems 
to make up for the lack in accuracy by smaller computation time for each step when compared 
to the pair-wise methods. 

A further increase in the runtime of the proposal distributions regarding pairs of mutations 
needs to be attributed to the fact that they require precalculation of all steps in all samples with 
up to two segregating sites. When analysing the more complex real data sets of the previous 
section, this precalculation becomes a substantial component of the total computing time. For 
example, our current implementation needed about 38 seconds for the precalculation of samples 
of size 50, but this rapidly increases to 4250 seconds for samples of size 100. In contrast, 
the proposal distribution Q^-huw Q^\y requires precalculations for samples with one or zero 
segregating sites, which is negligible for samples of size 100. 

Still, if a sample configuration can be analysed by the proposal distribution Q^-^uw 
then this proposal distribution yields a good performance. Furthermore, when several datasets 
are to be analysed, the program MetaGeneTree allows to save computing time by storing the 
precalculated optimal proposal weights in a file. 

In conclusion one can say that the methods using compressed genetrees present an im- 
provement over the 'canonical candidate' Qq'^"^ or the heuristic generalisation of Stephens and 
Donnelly's idea for the Lambda-case, Q^'^^. For small to moderate sizes the pair- wise methods 
perform rather well with Q^-HUW^A outperforming every other method. 

In general, which proposal distribution works best in terms of real-time requirements depends 
on the particular data set and the parameters. Thus, for larger datasets, we recommend a small 
preparatory study to test the performance of the various methods. This can easily be done with 
MetaGeneTree. 
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Appendix 



A.l Generating samples: Details 



The following is adapted from jBBOSl Section 7]. Let {nt}t>o be a A- 
coalescent. We denote by {Yt}t>o the corresponding block counting process, 
i.e. Yt = 7^ {blocks of 11^} is a continuous-time Markov chain on N with 
jump rates 



The total jump rate while in i is of course —qa = Yl]Ji Qij- We write 

K, := ^ (A.1) 

for the jump probabilities of the skeleton chain, noting that {pij) is a 
stochastic matrix. Note that in order to reduce i classes to j classes, an 
i — j + 1-merger has to occur. Let 



g{n, m) := 



"OO 



forn > m > 2 (A.2) 



be the expected amount of time that Y , starting from n, spends in m. 
Decomposing according to the first jump of Y , we find the following set of 
equations for g{n,m): 

n-l 

g{n,m) = ^pnkg{k,m), n > m > 2, (A. 3) 

k=m 

g{m,m) = , m > 2. (A-.4) 

Let us write F*^"^ for the process starting from Yq^^ = n. Let r := inf{t : 
y/"^ = 1} be the time required to come down to only one class, and let 

be the time-reversed path, where we define Y^^^ = d, some cemetery state, 
when t > r. 

With the above definitions, Y^^^ is a continuous-time Markov chain on 
{2, . . . , n} U {d} with jump rates 
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and ql^g = —qnn^ where g{n^ m) is as in (IA.2p . The starting distribution of 
y(") ig given by 

Ft{Y^''^ = k}=gin,k)qku 

for each k. We write |n| := X^^^i^^j, and denote q^^^ := —qj^i^- 
Note that 

Qk^ = -to, 2 < /c < n, (A.6) 

i.e., the total jump rate of y^"^ in state k < n does not depend on n. 
( \AM follows from the observation that by monotonicity of paths, the set 
of times that Y'^^^ (and thus y(")) spends in a given state k is a.s. an 
interval (possibly empty), thus 



1 



E 



E 



1 



-Qkk 



F{3s : Yi^) = k} P{3s : yW = A;} 



~(n) 



because the hitting probability and the length of the time interval spent 
in k are the same for the path y*-") and its time-reversal. 

Let (y/"'')£=o,i,2,... be the skeleton chain of the time-reversed block count- 
ing process. We parametrise time for F^") in such a way that Yq^^ = 1 
and y/"^ = Y^''\ Thus, is a Markov chain on {l,2,...,n} U {d} 
with transition matrix p^^j} = g{n^k)qki {2 < k < n), p^J^^ = q^^^/qj^^ 



[2<j<i< n), p 



in) 



1 



_(n) 



The time-reversed block counting processes corresponding to different 
'target' sample sizes are related as follows: For ni < n2 and any Iq = 1 < 
£i < • • • < £to < ni, we have 



'{y/"^^ = ^.,^ = o,...,m} = 



5^(^1,4 



'{y/"^^ = ^.,^ = o,...,m}. 



in particular, for £ < rii < n2, 

5/(^2, £)P{y("i) hits £} = i/(ni,^)P{y("^) hits £}. 
To see (IA.7P note that foi 1 = £o < ■ ■ ■ < £m ^ n 



(A.7) 
(A.S) 



m— 1 



£/(n,fi)g4,i J]^ gJJ^^^ = g{n,£i)qi^,i ]J = i'l^'^^ 



m— 1 



i=0 
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dividing both sides by YULi^ Qi^^ = YY^Ll^{-Qi^e^) gives 

m— 1 m—2 / £ ^ Tn—\ 

The law of the sequence {Zq := ([(0)]_, ({1})), Zi, . . . , Z^) generated by 
Algorithm lAJJ is that of the sample histories described in Section [2]T1 Note 



that it agrees with jBBOSi Algorithm 1] except for the way the ordering of 
the types is generated. 

Algorithm A.l Algorithm to generate a sample under the A-coalescent in the infinitely many 
sites model. 

1) Draw K according to the law of Yq^\ i.e. Y'i{K = /c} = g{n, fc)'/fci- Begin with a single 
'ancestral type' with multiplicity K, i.e. t = (xi),xi = 0,n = (K), and so d = 1. Set 
s := 1. 

c:=l, := (t,(K)). 

2) Given = (t,n) with d types, let k := |n|, and draw a uniform random variable U on 
[0,1]. 

o If [/ < — 1 then draw one type, say /, according to the present frequencies. 

kr+q^. 

- If n/ = 1, Zc+i arises from Zc by replacing x/ by (s, xjq, . . . , xjj^-^). Increase s 
by 1. 

- If n/ > 1, Zc+i arises from Zc as follows: Copy Zc, decreasing nj by one. Then 
define a new type x' = (s, xjq, . . . , xjj^-^), draw J uniformly from {1, . . . , d + 1} 
and insert x' with multiplicity one into Zc+i just before the previous type J (with 
the convention that the new type is placed at the end of Zc+i when J = d + \). 
Increase s and d each by one. 

o If [/ > — them 

kr+q)_ 

- If |n| = n, stop. 

- Otherwise, pick J € {fc + l, . . . , n} with Pr{ J = j} = j/^#n- Copy Zc+i from 
Zc- Choose one of the present types / (according to their present frequency), 
and add J — |n| copies of this type, i.e. replace rii := rii + J — |n| in Zc+i- 

3) Increase c by one, repeat 2). 



A. 2 A discussion of the combinatorial factor c(t, n) appearing in (11.71) 

Let t, a, n^*^^ = n, and thus also the sample size n = |n|, the number of 
segregating sites s and the number of different types d visible in the sample 
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be given. We evaluate c(t,n) more explicitly, using ideas from Griffiths 



Recall that an unordered unlabelled sample configuration with unordered 
types [t, n] is equivalent to a non-planted rooted unlabelled graph-theoretic 
tree r with n leaves and s + 1 internal vertices (a rooted graph-theoretic 
tree is called planted if the root node has degree one and non-planted 
otherwise), see |G87l Theorem 1]. In this parametrisation, the leaves of 
r = r([t,n]) correspond to the (unnumbered) samples, the internal nodes 
to segregating sites (except for the root of r) and types to internal nodes 
with at least one subtended leaf. By contrast, a given (t, n) with d ordered 
types can be viewed as such a tree in which the d internal nodes with at 
least one subtended leaf carry distinct numbers from {1, . . . namely 
the type numbers. 

The basic observation behind the following lemma is that removing the 
root node (and connecting edges) from a rooted tree leaves a number of 
(possibly planted) rooted trees that can be grouped into classes of isomor- 
phic trees. 

Lemma A.l. Order the types in [t,n] in some arbitrary fashion, yielding 
(t,n). Let the root of r = r([t,n]) have k > descendants, < i < k 
of which are leaves. Group the subtrees founded by the descendants which 
are not leaves into isomorphy classes (isomorphy as rooted trees). Write 
r for the number of non-leaf classes and (/i, . . . , for their sizes (in some 
arbitrary ordering). Necessarily gi + ■ ■ ■ + gr = k — i. Call representatives 
of the r different classes ti, . . . , Tr. There are 



permutations of the type numbers that do not change r, with the empty 
product interpreted as 1, and c(t,n) is defined in (11.81 ). 

Proof. We prove the statement by induction on the number of nodes in 
r (equivalently, the sample complexity). For a tree with 3 nodes, corre- 
sponding to a sample of size 2 with no mutations, Equation (IA.9p yields 
the correct answer 1. 

Now consider r, where the root has k—^ non-leaf descendants in r classes 
of sizes 5^1, ... , gr. For each i = 1, . . . , r there are c{Ti) ways to permute 
the type names without changing Ti (viewed as an unnumbered unlabelled 



r 




(A.9) 



35 



(a) Exchanging type I and II (b) Exchanging type I and II does not alter the 
does not alter the genetree. graph-theoretic tree. 



Figure 18: The effect tliat reordering does not change the tree visualised in both corresponding 
representations. 

sample with ordered types). Since there are gi representatives of this class 
attached to the root, this yields c{Tiy^ possibihties. Additionally, we can 
interchange the complete set of type names between the subtrees in class 
giving another factor gi\. Since the type name changes in a given class 
do not affect the changes in the other classes, the factors from each class 
have to be multiplied to obtain the result. □ 



Remark A. 2. (1) See Figure 18(a), 18(b) for two representations of 



(t,n)= (((2,1,0),(3,1,0),(4,0)),(2,2,3)) 
which has c(t, n) = 2. 

(2) When implementing the recursion (lA.Op on a computer, one obviously 
has to compute isomorphy classes of subtrees of a given tree. There, we 
have found it useful to pass to planar representatives of the given graph- 
theoretic rooted trees and implement a total order on such trees (for which 
there are various possibilities). 



A. 3 Speed-up: Precomputations and multiple parameter sets 

Assume that for some A C T*, pe{t' ^y\!) is (numerically) known for all 
(t', n') G A. In practice, this can be achieved by including in A only such 
samples for which (11.101 ) can be solved numerically on the given computer 
architecture. 

This information can be combined with importance sampling schemes 
as discussed above by running the proposal chains only until they hit A^ 
thus reducing the variance of the estimators: Let % = {Hi) := (H^i) be 
the time-reversed history, (t,n) G T with |n| = n be given and let Q 
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be a proposal distribution (compatible with (12.41 )) under which [Hi] is a 
Markov chain, starting from Hq = (t, n). Then we have 



Xr 



rn + Xr 



TT- 

1=0 Qi^i ^i+l 



(A.IO) 



where r4 := min{? : Hi G A] and \Htj\ denotes the number of samples in 
Hrj^. Analogous to (12.61) . by averaging the term inside the Q-expectation in 
(lA.lOp over independent draws from Q, this yields an unbiased estimator 
of ]36i(t, n) whose variance will be smaller than that of (12.61 ). 

For given (t, n) = /ig, hi, . . . ,hs G T* with hi ^ A, i = 0,1, s — 1, 
hg G A, we have 



ln{{H^s, H s+1, ■ ■ ■ , Hq) = {hs, . . . , ho)) 

s-1 

= fe,n{n hits h,) ( W P,,.(/i,+l ^ 



Xr 



rn + Xr 



by the Markov property under Pg* thus (lA.lOp follows from (12.101 ). Lemma [2]T 
and the Markov property under Q. 

Note that (12.61 ) and the analogous estimator built from (lA.lOp can be 
used to simultaneously estimate po{t,n) for various values of 6 from the 
same runs under a given Q (of course, yielding correlated estimators). This 
can be computationally more efficient for example when computing likeli- 
hood surfaces. See, e.g., |TZ04] . Sect. 6.3 on how to combine estimators 
from different runs. 



A. 4 Estimating times and aspects of the genealogy given the data 

The time-reversed history (Hi) = (H^i) describes the skeleton chain of 
a(n ?7,-)A-coalescent with mutations according to the IMS model. It is 
straightforward to augment this with 'real times' (on the coalescent time 
scale): Given H = {Hq, . . . , Hr-i), the coalescent process will spend time 
Vi in the i-ih state, where the Vi are conditionally independent with 
C{V,\H) = Exp(r|^,| + A|^^|), thus := Vo + -- • + the time of the i-th 

event, can be readily simulated given Ti. Furthermore, for any function 
f(^{Hi), (Ti)) of the reversed history and its (coalescent) time embedding. 
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we have 



(A.ii; 



= E 



Q 



L Q{n) 



for any proposal distribution Q satisfying (12.41 ) , where imphcitly, the con- 
ditional law of (Tj) given H = {Hi) is the same under Q and under P^/^. 
Thus, in analogy with (12.61 ). 



M 



(A.12) 



is an unbiased and consistent estimator of (lA.lip . where 'H^^\ . . . 

and the corresponding {T^^^)^ . . . , (T^^^^) are independently drawn from Q. 

For example, using f{{hi),{U)) = h + ■ ■ ■ + U-i or f (ijii) , {U)) = 
l(ti + - • •+^T-i < x)^ combined with an estimate of ]?6'(t,n), this approach 
can be used to estimate the conditional mean or even the conditional 
distribution of the time to the most recent ancestor of the sample, given 
the observed data. Similarly, the conditional age of a particular mutation 
can be estimated (when undoing the equivalence relation ~). This extends 
the line of thought from |GT94] to the Lambda-coalescent context. 
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